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Abstract 

The Consensus Clustering problem has been introduced as an ef- 
fective way to analyze the results of different microarray experiments 
[51 [6] . The problem consists of looking for a partition that best sum- 
marizes a set of input partitions (each corresponding to a different mi- 
croarray experiment) under a simple and intuitive cost function. The 
problem admits polynomial time algorithms on two input partitions, 
but is APX-hard on three input partitions. We investigate the restric- 
tion of Consensus Clustering when the output partition is required to 
contain at most k sets, giving a polynomial time approximation scheme 
(PTAS) while proving the NP-hardness of this restriction. 

1 Introduction 

Microarray data analysis is a fundamental task in studying genes. Indeed, 
microarray experiments provide measures of gene expression levels under 
certain experimental conditions, showing that groups of genes have a simi- 
lar behavior under certain conditions. However, even slightly different ex- 
perimental conditions may result in significantly different expression data. 
These gene expression patterns are useful to understand the relations among 
genes and could provide information useful for the construction of genetic 
networks. Nowadays the use of microarrays has become widespread and 
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sufficiently cheap to justify running a large battery of experiments under 
similar, albeit not identical, conditions. The integration of the results is 
therefore the final computational step needed to obtain a meaningful inter- 
pretation of the data. 

In 0E] a clustering approach to the integration of different experimental 
microarray experimental data was introduced. In the proposed approach, 
called Consensus Clustering, the genes are represented by elements of a 
universe set. The experimental data under certain experimental condition, 
are represented as a partition of the universe set, where a set represents 
elements (genes) that have similar expression level in the experiment. The 
proposed approach then computes the consensus of the partitions given by 
a collection of gene expression data, since integrating different experimental 
data is potentially more informative than the individual experimental data. 
More precisely, Consensus Clustering asks for a partition of the universe 
set that better summarizes a set of input partitions on the same universe. 
The Consensus Clustering problem has been studied extensively in the 
literature and its NP-hardness over general instances is well-known [9j [11] . 

The minimization version of Consensus Clustering, called Minimum 
Consensus Clustering, admits a |-approximation algorithm [1] as well 
as a number of heuristics based on cutting-plane [8] and simulated anneal- 
ing [6]. In the latter paper, it was observed that the problem is trivially 
solvable for instances of at most two partitions, while an open question, as 
recently recalled [I], is the computational complexity of the problem (for 
both minimization and maximization versions) on k input partitions, for 
any constant k > 2. The question has been settled in [3] by showing that 
Minimum Consensus Clustering is APX-hard even on instances with 
three input partitions, hence making hopeless the search for a polynomial 
time algorithm. In this paper we will focus on the restriction of the problem 
where the desired consensus partition has at most k sets, with k a constant. 

A problem closely related to Minimum Consensus Clustering is Min- 
imum Correlation Clustering. In Minimum Correlation Cluster- 
ing, given a complete graph where each edge is associated with a label in 
{+, — }, the goal is to compute a partition of the vertices of the graph so that 
the number of co-clustered vertices joined by — edges and and the number of 
vertices joined by + edges and not co-clustered is minimized. The restriction 
of Minimum Correlation Clustering where the output partition has at 
most k sets, is NP-hard but admits a PTAS [7j. We will extend the anal- 
ysis of [7] by showing that the analogous restriction Minimum Consensus 
Clustering admits a PTAS, while being NP-hard. 

Notice that Minimum Correlation Clustering and Minimum Con- 
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SENSUS Clustering are not comparable, since the input graph in Minimum 
Correlation Clustering is unweighted, while the input graph of Min- 
imum Correlation Clustering is weighted. On the other hand, it is 
quite immediate to notice that there are unweighted graphs that are not an 
instance of Minimum Consensus Clustering. 

2 The problem 

We will tackle the Consensus Clustering problem, in its minimization 
version. Two elements of the universe set are co-clustered in a partition tt 
if they belong to the same set of tt. 

Definition 2.1. Let V be a universe set and let 7Ti,7T2 be two partitions 
of V. Let d(7Ti,7T2) denote the symmetric difference distance defined as 
the number of pairs of elements co-clustered in exactly one of 7Ti and tt2- 
Let s(iri, 7T2) denote the similarity measure defined as the number of pairs of 
elements co-clustered in both partitions plus the number of pairs of elements 
not co-clustered in both partitions tt\ and TT2- 

Given two elements i, j of the universe set V and a set II = {tt\ , . . . , 717} of 
partitions of V, we denote by sjj(i,j) (or simply s(i,j) whenever II is known 
from the context) and the distance dn(i,j) (or simply d(i,j)) respectively, 
the number of partitions of LT in which i,j are co-clustered and are not 
co-clustered. Clearly, for each pair dn(i,j) + sn(i,j) = /, that is the 

number of partitions. When LT consists of 2 partitions tt\ and TT2, we denote 
by d(7Ti,7r 2 ) the quantity £) i<y . d^ un2 y(i,j). 

We are now able to formally introduce the problem we will study in 
this paper, Minimum Consensus Clustering when the output partition 
is required to have at most k sets (denoted by k-Min-CC): we are given a 
set II = {7Ti,7T2, . . . , 7T; } of partitions over universe V and we want to find a 
partition tt of V, such that tt has at most k sets and tt minimizes d(ir, LT) = 
^2 \ =1 d(7T, iTi), that is the cost of solution tt. In what follows, we denote 
by k-Min-CC(Z) the restriction of the k-Min-CC problem where the input 
consists of exactly I partitions of V. 

The Minimum Consensus Clustering is closely related to the Mini- 
mum Correlation Clustering [2], where we are given a labeled complete 
graph, with each edge labeled by either + or — and the goal is to compute 
a partition C\ , C2 , . . . , C\~ of the vertex set so that the number of + edges 
cut by the partition and the number of — edges inside a same set Ci is min- 
imized. Several variants of the correlation clustering have been introduced 

Ismail]. 
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An instance of Minimum Consensus Clustering can be represented 
with a labeled complete graph G = (V,E), where each edge (v,w) £ E 
is labeled by sn(v,w). In Section [3] we assume that the instance of K- 
MlN-CC(Z) is precisely this graph representation of Minimum Consensus 
Clustering. 

3 The PTAS 

In this section we will show that the k-Min-CC admits a PTAS, that is for 
any e > a polynomial time approximation algorithm with a guaranteed 
1 + e ratio between the costs of the approximate solution and the optimal 
solution. Let G = (V,E) be the complete graph instance of k-Min-CC. 

The MinDisAg algorithm of [7J for Minimum Correlation Cluster- 
ing can be restated to solve k-Min-CC and is reported here as Alg. [TJ Let 
us detail the idea behind MinDisAg [7] and how it can be generalized. First 
of all, some "small" instances are solved by a brute force approach, namely 
when only one set must be computed or when the number n of input ele- 
ments is polynomial in k (the number of desired output sets). In fact, there 
are at most k n possible partitions of V, and k n is a constant whenever n is 
polynomial in k. 

The algorithm starts by randomly sampling a subset S of V. If the 
sample is not too large (i.e. O(logn)), then it is possible to compute all 
partitions of S in polynomial time. Since the steps that the algorithm per- 
forms for each partition require polynomial time, the whole algorithm has 
polynomial time complexity. 

The algorithms extends each partition of S to a partition of V. Since the 
number of partitions of S is polynomial, we can restrict our attention only 
to the partition S that fully agrees on S with the overall optimal solution 
T>. On that specific partition, extending S to a partition of V introduces 
only a few errors. 

More precisely, the algorithm applies a greedy procedure to extend S: 
it assigns independently each element x of V \ S to the cluster of S that 
minimizes the total cost of all pairs made of x and an element of S. 

This procedure computes a clustering of V into sets that can be distin- 
guished into large and small, depending on the fact that a set is smaller or 
larger than a certain threshold. The large sets are retained, while all small 
sets are merged together obtaining a new universe set which is in turn re- 
cursively fed to the algorithm (only this time requiring a smaller error ratio 
and obtaining a partition with fewer sets.) 
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We remember that I denotes the number of input partitions, and k de- 
notes the number of sets in the output partition. Given a partition P of V, 
the cost of P is denoted by cost(P). Let e' be equal to 128 | Q2fc 4 (i.e. e' is 
a constant depending only on e and the number of sets in the output parti- 
tion). We distinguish two cases: the optimum is at most e'n 2 or at least e'n 2 . 
In the latter case we exploit the fact that it is possible to solve the problem 
in polynomial time and with a guaranteed additive error ee'n 2 , where n is 
the number of elements in the universe, for any constant e > (see [3] for 
details). Then the approximation ratio is at most e€ n ,^1 " = 1 + e, that 
is the algorithm in [3] computes the required approximate solution. There- 
fore, in the following we only have to investigate the case when the optimal 
solution has a cost at most e'n 2 . 

We define t = 256 °o° ofc4 log n as the size of the sample set S, V = 
{T>\, . . . ,Pfc} as the optimal solution (whose cost is denoted by jn 2 ). Let 
S be the partition {X n S : X € T>}, that is the restriction of T> to the set 
S. We recall that we will mainly focus on the iteration of steps 6-21 where 
such S is extended to a partition of the universe set V. Let A be a partition 
of a set A C V, and let x be an element of V. Then N (x) is the set of 
all elements of A different from x and co-clustered with x in A. Given an 
element u € V, define valf-(u): 
i 

valf(u) = r^TT^jl (\i x G nA (u) A s(x, u)=i}\ + \{x £ N A (u) A d(x, u) = i}\) 

Informally valf-(u) is the fraction of pairs consisting of u and an element 
of A that may give a contribution I — % to the cost of the solution. Moreover 
we define val^(u) as 

l\A\{u}\ 

Informally val^(u) is the fraction of input pairs containing u on which A 
agrees. Notice that val^(u) = |^ =1 i • valf-(u). Let A be a partition 
of the set A, then A(u, i) is the partition obtained from A moving the 
element u to the set Ai (notice that u may not belong to A). Given an 
integer j, with 1 < j < I, define pval^(u, i) = val^ u,i '{u) and pv al^(u, i) = 

valj U,i \u). Finally we introduce the notion of /3-good partition, which is 
a good approximation of the optimal partition. Let X be a subset of V, A 
be a partition of A and (3 = 12g ^ fc4 . Then A is (3-good if for each u G V, 
< j < I and 1 < i < k, then 

\pval^(u,i) — pvalf (u, i) | < (3. 
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Algorithm 1: MinDisAg(fc, e) 
Input: A set II of partitions of V 

Output: A fc-clustering of the graph, i.e. a partition of V into at 
most k sets V\ , . . . , V& 

1 if k = 1 then 

2 Return the obvious 1-clustering; 

3 if n < 16k 2 then 

4 Return the optimal ^-clustering, obtained by exhaustive search; 

5 ClusMax<— the result of the PTAS for Max Consensus Clustering [3] 
with accuracy e(e, k); 

6 Pick a sample S QV by drawing \S\ = 500 ^° gn elements uniformly at 
random with replacement; 

7 m <— oo; 

8 foreach each partition S of S, S = {Si, . . . , Sk} do 

9 Initialize the clusters Cj <— Si for 1 < i < fc; 
10 for each u € V \ S do 

n j n <- argmim {cost (5 \ 5, U (Si U {«}))}; 

/* j u maximizes pval s (u,j u )} */ 
/* val s (u) <— pval s (u, j u ) */ 

12 Add u to the set Cj u ; 

/* Compute the set of large and small clusters */ 

13 Large <- {j\l < j < k, \Cj\ > §:}; 

14 Small <— {1, . . . , k} \ Large; 

15 I <— \Large\ and s <— /c — / = ISma/^l; 

16 W^UjeSmaZzCi; 

17 IT <— the restriction of the partitions in II to the new universe set 
W; 

18 Recursively call MinDisAg on the partitions IT' and with 
arguments (s, e/3). Denote by W{, W 2 -, ■ ■ ■ W' s the result; 

19 C^{C 1 ,...,C l ,W{,...Wi}; 

20 if cost(C) < m then 

21 m <— cost(C); 

22 ClusMin^ C; 

23 Return the better of the two clusterings ClusMax and ClusMin; 
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3.1 Analysis of the Algorithm 

Notice that the main contribution of this section lies in Lemma 13.11 which is 
a stronger version of a result in [7]; in that paper the notion of pval A {u,i) 
is sufficient because the problem studied is unweighted. In our paper we 
study a problem where each pair of elements can have a cost that is an 
integer between and /, therefore we need a definition of pval A (u, i), with 
a new parameter j expressing the number of input partitions where two 
elements are either co-clustered or not co-clustered. Indeed our definition of 
/3-goodness requires that a certain inequality holds for values of j that are 
integers between and I, while in [7] j can - implicitly - only take or 1 as 
value. 

Recall that we denote by S the restriction of V to the sample set S. The 
following lemma proves that S is, with high probability, a good sample of 
the optimal solution. 

Lemma 3.1. The partition S is fi-good with probability at least 1 — 0(-^=). 

Proof. Let v be an element of S and let u be an element of V. Let p(v, i,j) 
be a variable equal to 1 if and only if v G ^v(u,i)( u ) an d s(v,u) = j or 
v £ Nx>(u^i)(u) and d(v,u) = j. Pose p(v,i,j) = otherwise. 

By construction ofp(v, andpval^iu, i), the probability Pr\p(v, i,j) = 
1] = pval^(v, i), as the set S is sampled randomly from V. Also notice that 

pvalf(v,i) = valf v > l \v) = (\{x £ Ng M (v) A s{x,v) = j}\ + \{x £ N §M (v) A d(x, v) = 

rgrnjn X^eSUu} p( Vj as * ne l & tter equality is an immediate consequence 
of the definition of p(v,i,j). 

The Hoeffding bound states that, given some causal variables such 
that Pr[Xi = 1] = p (and X< = otherwise), then Pr[\X a - i Y%Li X a\ > 
f3] < 2e- 2m/3 . In our case the causal variable X a are p(v, and the sum is 
over all elements v € S , \{u}, therefore the inequality becomes Pr[\p(v, i,j) — 
lS^E ceS \ W P(".M')l >P}< 2e- 2 (l s \MI^ 2 < 2e" 2 ^ 2 . By the previous 
arguments, the inequality can be rewritten as Pr[\pvalj > (u, i) —pvalj (u, i) \ > 
(3} < 2e- 2t/3 , which gives an upper bound on the probability that any ele- 
ment u € V does not satisfy the requirements of an /3-good set. 

Applying a union bound we obtain that the probability of having at least 
one of the t elements not satisfying the requirements is at most 2te~ 2t @ . 
Since \S\ = 50 ° 3 1 ° s n , the partition S is /3-good with probability at least 

1 _ 2 5001ogn e -l0001o g „ = 1 _ 2 SOO^gWlogn ^ ? which j g krger 
than 1 j= for some constant c. □ 
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We will now provide some simple generalizations of the Lemmas in [7J, 
omitting the proofs as they are straightforward extensions of those in [7J. 
Just as in we will assume that the sample S is /3-good, for some constant 
(3, and we will focus on the iteration of the algorithm for the partition S of 
S that agrees with the optimal partition D. We will denote by Ci,...,Ck 
the sets in ClusMin at the end of such iteration. 

Lemma 3.2 (Lemma 4.3 in [7J). Let u £ V\S with u 6 T> s (that is the s-th 
set of the optimal solution), and u € C r for r ^ s (that is u is misplaced 
by the algorithm). Then pvalj 1 (u,r) > pvalf(u, s)-2/3 = valfiu) - 2/3 for 
each < j < k. 

Recall that I is the number of input partitions, define Ti ow as the set 
{u € V : val v (u) < 1 — ^M 1 ^ anc ^ us ca ^ ^ad all elements in T[ ow and 
good all elements that are not in Ti ow . As each element u in Ti ow contributes 
to the cost of a solution of k-Min-CC(Z) for at least \l{n-l)(l-val v (u)) < 
4^W/(ra — 1), a simple counting argument allows us to prove that there are 

at most 8 ° 7 " fc bad elements. 

For clarity's sake, we split Lemma 4.4 in [7J into two separate statements, 
where the first statement (Lemma I3.3|) is actually proved in the first part 
of the proof of Lemma 4.4 in [7J, while the second statement corresponds 
to Lemma 4.4 in [7J. Those technical results show that (i) our algorithm 
clusters almost optimally all good elements and (ii) all good elements in 
Large are optimally clustered, pending a condition on various parameters 
that will be proved at the end of the section (for the definition of Large and 
Small see Algorithm [T]) . More precisely, Lemma 13.31 states that misplaced 
good elements must belong to some small sets (which in turn implies that 
the majority of good elements must be optimally clustered). 

Lemma 3.3. Let u be an element in Ci \ Ti ow but not in T>i \ Ti ow . Then 
u G Vj, for some j ^ i, and |Pj| < 2(^p- + (3)n + 1. 

Proof. The proof is the same as in [7J, except for the observation that, 

by our definition of pval and since each pair of elements involving u is 

correctly co-clustered when u is in either T>{ or T>j, pvalP (u, j)+pval D (u, i) < 
2 _ tflPil+lPjl-i) n 

l(n— 1) 

Lemma 3.4. Let i be an element in Large. If ^ — 7 ra2 |^~ry > 2 (A; + 
!) {(m? + ^ n + and 2( -2bh +P)n + k<%- then Q \ T low = 



S 



Proof. Let x G V \ T[ ow . W.l.o.g. we can assume that x G C\ \ T\ ow and 
leDjU 7/ ow . First we will prove that C\ \ T[ ow CP^ Ti ow . Assume to 
the contrary that there exists a y € C\, y £ T>i,Ti ow , therefore (w.l.o.g.) 
y € X>2- By Lemma 13.31 and since there are at most sets in D, then 
Id \ (P x u r lott )| < 2(^ + pk)n + k. 

Since d \ (Pi U T ioiu ) = (C x \ V x ) \ T low = (d \ T low ) \ V x then \V X \ > 

Icat^imca^ut^)! > iCilWTiwiWdWhuTu^i > 

2( 2M + 0*0" + But M " 7^ 2 i^y " 2(^ + 0*0" + k > 2(-^ + /?)n + 1, 
which contradicts |X>i| < 2(^7 + /3)n + 1. In fact ^ - jn 2 j^^ ~ 2 (2M + 
/3&)n + k > 2(^7 + 0)n + 1 can be rewritten as ^ - jn 2 ^0^ > 2(k + 

1 )(( 2 W + ^ + 1 )- 

Now we know that C\ \ T[ om CP^ T\ ow and we would like to prove that 

C\\Ti ow D X>i\Tz olu , along the same lines as for the first part. Assume to the 

contrary that there exists a y € T>x, y Ci,Ti ow , therefore (w.l.o.g.) y £ Ci- 

Again by Lemma [373], both £>i and T>2 have at most 2{-^j+(5)n+l elements. 

Notice that C\ \ Ti ow C 2?i, since Ci \Ti ow C \ T[ ow , moreover Ci is large, 

therefore |d| > ^. By the value of \T low \, + /3)n + fc > ^ - 

which does not hold by hypothesis. □ 

Now we are able to show that there is a solution where some sets are 
exactly the large sets in ClusMin and whose cost is not much larger than 
the optimum. This fact justifies the recursive step of the algorithm. The 
condition under which the lemma holds will be proved at the end of the 
section. 

Lemma 3.5. If l(n - l)\Ti ow \ (2(3 + j^ryj < §7n 2 , then there exists a 

solution F = {Fi, . . . , Fk} such that the cost of F is at most 7?i 2 (l + e/2) 
and Fi = Ci for each i in Large. 

Proof. Let F be the solution consisting of all large sets in ClusMin and 
where all remaining elements are partitioned as in T>. Clearly the only pairs 
of elements that might not be partitioned in F as in ClusMin are the ones 
containing at least one element of T[ ow , by Lemma 13.41 By the definition of 
val, cost(F) — cost(T>) < l(n — 1) YlxeT, (val v (x) — val F (x)). 

We have to consider two different cases, depending on the fact that 
x ^ T[ ow belongs to sets Cj, T>i for a certain i, or not. In the first case 
w.l.o.g. x is in both C\ and T>\ the set of pairs that are different in Clus- 
Min and in T>, are only pairs of the form (x,y) where y € T[ ow , which 
in turn implies that val F {x) > val T> {x) — j^rfy- In the second case we 
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can assume w.l.o.g. that x € C\ and x G T>2- Applying Lemma 
we know that pval v (x,y) > val v (x) — 2/3. Also notice that in D(x,l) 
and F, the element x belong to the same set therefore, just as for the 
first case, val F {x) > val v ^ x ' 2 \x) - but val v( - x ^(x) = pval v (x,2). 

Combining all inequalities we obtain val F (x) > pval T '(x,2) — > 

vat D {x) — 2/3 — j^rfy, where the last inequality comes from Lemma 

In both cases we can say that val F (x) > pvafP (x,2) — ^-n — va l V ^- 



IT 

2/3 

~~ l(n°-Vi ■ ^ n hrimediate consequence is that cost(F) — cost(V) is at most 



Z(n-l) 

J(n-l) ' 

Kn-l)E, 6 T ;otu H F (^)-^(^)) < /(n-l)|TU(2/3 + |hy)- The 
claim follows since Z(n - l)|T i£W | (2/3 + j^rfy) <^n 2 e/2. □ 

Since the partitions i 7 and ClusMin are the same for all pairs where at 
least one element is in a large set of ClusMin, an immediate consequence is 
that the solution returned by the algorithm has cost at most 7n 2 (l+e/3)(l + 
e/2) which is at most equal to 7n 2 (l + e) for any sufficiently small e. The 
following technical result completes our proof by showing that Lemma 13.51 
holds. The proof is a mechanical consequences of the values of /3, \Ti ow \ and 
e'. 

Lemma 3.6. l(n - l)\T low \ (2/3 + jg^) < f 7 n 2 . 

Proof. Since \T[ ow \ < 4 ° 7 " fc and (3 = 2 o-wok' 2 l ' ^ suffices to prove that l(n — 

(20OT + F^S) < ! V that is equivalent to (^ + + ;{ 

en. Since we are only interested in instances where the algorithm of [3] 
fails to provide a (1 + e) approximation ratio, we can assume that 7 < e' = 
128-2W ' consequently it suffices to prove that MlLdl^i (^J^ + 128 .2 2 °ff_i )fc 4 

en that is equivalent to 4 ("~ 1 ) (^ + 64 ;^_ 1 ) J < w which in turn is equiva- 
lent to 9n < 80ln + 4n which is trivially true. □ 

To complete the section and the analysis of the algorithm, we need to 
prove that the assumptions that we have made in some of the previous 
lemmas actually hold. The proofs are mechanical and quite tedious conse- 
quences of the values of (3, 7 and e'. 

Lemma 3.7. Ifn > 16k 2 then % - 7 n 2 jg^y > 2(k + 1) ((^ + 0)n + l) . 

Proof. By the values of 7 and /3, and since we can assume that 7 < e' = 
128-20^ fc 4 ' tne inequality can be rewritten as ^ - 12 8-20^ 4 n2 i(n-i) > + 
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!) {(m* + udom ) n + !) which can be simplified as ^ - 64^n 2 I( ^ Ty > 
2(fc + 1) ((2^7 + i28-20 2 fc 4 ) n + -O - Since < 2, it suffices to prove that 

f (i-Tfrfon) > 2 ( fc + 1 )((2(^ + T28i(PF) n + 1 )- As fc >* > 2 and e is 
t* n y> i6-20fc? ^ TDM' therefore we are only interested in proving that ■ 
§ > 2(k + 1) ((^ + 128 . 2 V fc 4> + 1) which is equivalent to ^ ■ f > 
(k + l)(j^ + 6 4.oq2 ^4 ) n + 2(A; + 1). Again, < 2, therefore it is sufficient 
to prove that • ^ > + 32 2 ^ fc3 )ra + 2(k + 1) which is equivalent 

to TM0 ■ f > + 2 ( fc + !)■ Since fc ^ 2 ' 32l^F < M5' hence [t 

suffices to prove that ^ > 2(/c + 1) which is an immediate consequence of 

the assumption n > 16k 2 . □ 

Lemma 3.8. If n > 16k 2 then 2(-^ + 0)n + k < £ - . 

Proof. By the values of 7 and /? the inequality can be rewritten as 2(^jp- + 

128^F ) n+fc < t- 80 f 2 i 28 . 2 V^ which can be simplified as f + + 1^) + 

2fc < |. As fc, Z > 2, it is immediate to notice that ^ + 32 2 ^ fc a + 16 , 2o;fc < 4, 
therefore it suffices to prove that 2k < which is an immediate conse- 
quence of the assumption n > 16k 2 . □ 



4 NP-hardness 

In this section we prove that 2-Min-CC(3) is NP-hard. From the NP- 
hardness of 2-MIN-CC, it is easy to show that also K-MIN-CC(3) is NP- 
hard for any fixed k. Our proof consists of a reduction from the NP-hard Min 
Bisection Problem (MIN-BIS) to 2-MIN-CC(3). The MIN-BIS problem, 
given a graph G = (V,E), asks for a partitioning of V in two equal-sized 
sets so that the number of edges connecting vertices in different sets is 
minimized. 

For our purposes, in this section we give a different, but equivalent, 
definition of cost of a solution tt of Minimum Consensus Clustering 
over instance II can be alternatively defined as: 

( r n(hj)du(i,j) + (1 - r n (i,j))s n (i,j)), (1) 

V(*<j) 

where r n (i,j) = 1 iff are co-clustered in n, otherwise r n (i,j) = 0. 

The above formula will be used in the paper (see Section 4) to define the 
cost of a set P of pairs in a solution tt as j)<zp{ r ir{h i)du{h j) + (1 — 
rn(i,j))sn(i,j))- 
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Given an instance G = (V, E) of MIN-BIS, where \V\ = n and \E\ = m, 
we build an instance of 2-MIN-CC(3) as follows. 

First we define the universe set V. For each v\ £ V, we define a set of n 4 
elements Xi = {x^i, . . . x in 4}, and a set of n elements Yi = {y^i, . . . , yi, n }- 
The universe set is V = (UjJQUl^). Next we define the three input partitions 
of 2-MIN-CC(3), II = {m, 7T2, 7T3}. Partitions 7Ti and 7r2 are identical and 
consist of n disjoint sets X^UYi, with i = 1, . . . ,n. The partition 7r3 contains 
the sets Xi, moreover for each edge (vi,Vj) G E, in 1T3 we have the set 
{z/i,fe> consisting of two elements taken respectively from Yj and Yj (the 
actual elements taken are not important, provided that tt^ is a partition 
of the universe set - which is trivial to obtain). Finally, in ^3 we have a 
singleton for each element of UY that are not in a two-element set according 
to the previous rule. 

Observation 4.1. Since all the elements in Xi are co-clustered in all input 
partitions, each Xi is contained in a set of the optimal solution. 

The previous observation allows ourselves to restrict our attention to 
solutions where all elements of Xi are co-clustered. Consider a solution 
7r = (Si,^). The cost of 7r can be expressed as the cost of all pairs of 
elements in ir. We can split the cost of ir into four parts: 

1. the cost of pairs of elements both belonging to UXj, 

2. the cost of pairs of elements with exactly one element belonging to 

UXi, 

3. the cost of pairs of elements in Yi x Yj with i ^ j, 

4. the cost of pairs of elements both belonging to a set Y^. 

We will call balanced a solution (5i,52) where both 5i and 52 contain 
exactly ^ sets Xi. The following lemma states that optimal solutions must 
be balanced. 

Lemma 4.2. Let ir = (Si, S2) be a solution of 2-MIN-CC(3^ ; then the cost 
of ir is at most |n 10 — |n 9 + 3n 7 + |n 4 — |n 3 if and only if ir is a balanced 
solution. 

Proof. Notice that the total cost of case 2) is at most 3n 2 • n 5 = 3n 7 as 
I U Yi\ = n 2 and U Xi\ = n 5 , while the sum of total costs of cases 3) and 4) 
is at most 3(™ ) = |n 4 - |ra 3 . 
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Let z be the number of sets Xi included in S\. The cost of the pairs of 
elements both belonging to UAj is C(z) = 3 ((^j) + ("j 2 )) n 8 - Indeed, only 
the pairs of elements in distinct sets Xi that are co-clustered in S\ and S2 
contribute to the cost, as no pair of elements belonging to two distinct sets 
Xi is co-clustered in an input partition. The minimum of C(z) is attained 
for z = 7j. For any other z, the value of C(z) is at least equal C(§ — 1). 

Since C(^) = |n 10 — |n 9 , the maximum total cost for a balanced solution 
is |n 10 — |n 9 + 3n 7 + |n 4 — |re 3 , while the maximum total cost for an 
unbalanced solution is at least C(| - 1) = f n 10 - §n 9 + 3n 8 6 > |n 10 - 
|n 9 + 3n 7 + §ra 4 - §n 3 . " " □ 

From Lemma 14.21 we can consider only balanced solutions. A balanced 
solution 7r is called standard if, for each i, Xi and Y\ are contained in the same 
set of tt. The following lemma shows that we can consider only standard 
solutions 

Lemma 4.3. Let tt = (Si, £2) &e a balanced solution of 2-MIN-CC('3 y ), 
i/ien f/ie cost of tt is at most |n 10 — |ra 9 + |n 7 — ^n 6 + |n 4 + ^n 3 — ^n 2 iff 
tt is a standard solution. 

Proof. Let 7r = (S±, S2) be a balanced solution, then the total cost of pairs 
of elements with exactly one element belonging to UXi is at most |ra 7 — ^n 6 
as all pairs in Xi x Yj, with % 7^ j, contribute with a cost 3 if and only if 
Xi U Yj is contained in a set of tt, and have no cost otherwise. At the same 
time all pairs in Xi x Yi have cost 1 in any standard solution, as Xi U Yi 
are a set of two input partitions, while in the third input partition, ^3, no 
pairs in Xi x Yi are co-clustered. If tt is a standard solution, then the total 
cost of pairs of elements in Yi x Yj with i ^ j is |re 4 as only half of such 
pairs are co-clustered in a standard solution. Following the reasoning of 
the proof of Lemma 14.21 with our new estimates of cases 2) and 3) , it is 
immediate to notice that, if tt is a standard solution, then its cost is at most 
|n 10 -|n 9 + |n 7 -in 6 + in 4 +n(2) = |n 10 -|n 9 +|n 7 - ±n 6 +±re 4 + ±ra 3 - \n 2 . 

Now assume that tt is not a standard solution, that is there exists an 
element y € Y% that is not clustered together with all elements of Xi. Again, 
following the same lines of the proof of Lemma 14.21 the cost of tt is at least 
|n 10 — |n 9 +|n 7 — ^n 6 +|n 4 +n 4 , as all pairs in {y}xXi have a cost 2, instead 
of 1 as in a standard partition. Since |n 10 — |n 9 + |n 7 — ^n 6 + |n 4 + n 4 > 
|n 10 — |n 9 + |n 7 — \n & + \n 4 + ^n 3 — \n 2 , the lemma follows. □ 

Given a standard solution tt, by construction of the reduction, with each 
edge (vi,Vj) € E, we associate a pair {y^h, Uj,i}- Let us denote by F the set 
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of such pairs, and by F c the subset of all pairs in F that are co-clustered in 
7r. We conclude the proof with the following theorem. 

Theorem 4.4. Let G = (V, E) be an instance of MIN-BIS, and let (tti, 7T2, ^3) 
be its associated instance of 2-MIN-CCf3j. Then (tti, tt2, ^3) has a solution 
of cost f n 10 - §n 9 + 3/4ra 7 - in 6 + ±n 4 + fra 4 - in 3 - \n 2 + (|F| - k) - k 
if and only if G has a bisection of cost k . 

Proof. Let (Vi, V2) be a bisection with cost k. Then let S\ be the set 
Ui 6 vi(.Xi U li), and let 5*2 = Uj g y 2 (Xj U 3^). By construction (Si,^) has 
cost f n 10 - |n 9 + f n 7 - |n 6 + ±n 4 + |n 4 - ±ra 3 - ±ra 2 + (|F| — k) — k. 

Now let (Si, S 2 ) be a solution of 2-MIN-CC(3) with cost | n 10 /4- §n 9 + 
f n 7 - in 6 + \n 4 + | n 4 - ±n 3 - ±n 2 + (|F| - k) - k. By Lemmas 021 S3] 
(Si,S2) must be a standard solution. 

Recall that the cost of a solution it = (Si, S 2 ) can be expressed as the 
cost of all pairs of elements in 7r, such a cost can be split into parts 1), 2), 
3) and 4). Moreover, following the proof of Lemmas 14.21 14.31 we know that 
the total cost of case 1) is |n 10 — f fi 9 , the total cost of case 2) is |n 7 — ^n 6 . 
By direct inspection the total cost of case 4) is ^n 3 + \n 2 . 

We still have to consider case 3), that is the cost of pairs {yi,q,Uj,t), 
with j 7^ i. We have to distinguish three cases, according to the fact that 
(Ui,q,yj,t) £ F — F c (in this case the cost is 1), (yi t q,yj t t) £ F c (in this case 
the cost is 2), (yi t q,yjj) £ F (in this case the cost is 3 if y^q and y^t are 
co-clustered, and otherwise. Therefore the total cost of case 3) can be 
written as n 2 (") +\F- F c \ - \F C \. 

Summing up the costs of the four cases we obtain a total cost |n 10 — 
|n 9 + |n 7 -in 6 + |n 4 + |n 4 -in 3 -in 2 + |F| -2|F C |. Consequently, taking 
into account the initial hypothesis, \F C \ = k. Let (Vi,!^) be the solution of 
G where V\ = {vi\X; L C 5i} and V2 = V — V\. By construction the number 
of edges of E crossing the bipartition (Vi, V2) is equal to \F C \ which, in turn, 
is equal to k completing the proof. □ 

5 Conclusions 

In this paper we have studied the Minimum Consensus Clustering prob- 
lem when the output partition contains at most a constant number of sets. 
We have shown that the MinDisAg algorithm [7\ can be applied also for our 
problem, hence showing that its applicability is not restricted to unweighted 
problems. Moreover we have proved that the same problem is NP-hard even 
on instances of three input partitions, thereby justifying our reliance on 
polynomial time approximation algorithms. 
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In our opinion the main idea behind MinDisAg algorithm could be ap- 
plied to some more general versions of both Minimum Consensus Clus- 
tering and Minimum Correlation Clustering than the ones studied 
here and in [7]. 
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